In this session you will calculate probabilities, quantiles and confidence intervals using the normal distribution.
By actively following the materials and carrying out the independent study before and after the contact hours the successful student will be able to:
Workshops are not a test. It is expected that you often don’t know how to start, make a lot of mistakes and need help. Do not be put off and don’t let what you can not do interfere with what you can do. You will benefit from collaborating with others and/or discussing your results. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. You may wish to refer to the independent study materials for reference.
Materials are indexed here: https://3mmarand.github.io/BIO00017C-Data-Analysis-in-R-2020/
These four symbols are used at the beginning of each instruction so you know where to carry out the instruction.
is something you need to do on your computer. It may be opening programs or documents or locating a file.
is something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.
is something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.
is question for you to think about an answer. You will usually want to record your answers in your script for future reference.
Artwork by Allison Horst
Top Tip
Did you set up a folder called ‘data-analysis-in-r’ (or similar) as directed in the first workshop? If not, you might want to do that now. Workshop 1
Start RStudio from the Start menu.
Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says Project: (None) and choosing New Project and then New Directory, then New Project. Navigate to the “data-analysis-in-r” folder. Name the RStudio Project ‘workshop3’.
On the Files tab click on New Folder. In the box that appears type “data”. This will be the folder that we save data files to.
Make a new script then save it with a name like analysis.R to carry out the rest of the work.
Load the
tidyverse:
library(tidyverse)For any distribution, two very useful quantities can be calculated:
- the Distribution Function, which gives the probability that a variable takes a particular value or less.
- the Quantile function which is the inverse of the Distribution function, i.e., it returns the value (‘quantile’) for a given probability.
The functions are named with a letter p or q preceding the distribution name. Below are some examples:
| Probability | Quantile | |
|---|---|---|
| Binomial distribution | pbinom() | qbinom() |
| Normal distribution | pnorm() | qnorm() |
| t distribution | pt() | qt() |
pnorm() Look up
pnorm() in the manual using ?pnorm
The normal distribution functions are documented together. The default normal distribution has a mean of 0 and a standard deviation of 1. This is known as the standard normal distribution. You give pnorm() a value and it gives you the probability of getting that value or less from a normal distribution with a mean of 0 and a standard deviation of 1.
For example, the probability of getting a value of 1, \(P[x < 1]\) , or less from this distribution is:
pnorm(1)## [1] 0.8413447
There are two ways to find the probability of getting 1 or more, \(P[x > 1]\).
1. By subtracting \(P[x < 1]\) from 1:
1 - pnorm(1)## [1] 0.1586553
2. By using the
lower.tail argument:
pnorm(1, lower.tail = FALSE)## [1] 0.1586553
Read the manual for
pnorm()! Notice that the lower.tail is set to TRUE as a default. If you do not give the argument, TRUE is the value it will take and is why you get the probability of getting a particular value or less. If we set lower.tail = FALSE, we get the upper tail and thus the probability of a particular value or more.
If you want the probability of a value from a different normal distribution you need to set the mean and standard deviation because mean = 0 and sd = 1 are the defaults.
I.Q. in the U.K. population is normally distributed with a mean of 100 and a standard deviation of 15. We will work with this distribution over the next few exercises.
We can use pnorm() to calculate probabilities associated with having a particular range of IQs.
We can use the values of mean = 100 and sd = 15 in pnorm() to work out the probability of having an I.Q. of 115 or less.
First, create variables for the parameter values:
# create variables for the parameter values
# we do this so we don't have to keep typing the values
# and if we want to use the same code on a different example,
# we only have to change two lines of code. This is good practice
m <- 100
sd <- 15 Now pass those variables to the
pnorm() function along with the value for which we want a probability:
# Now use pnorm()
pnorm(115, m, sd)## [1] 0.8413447
Because the default is lower.tail = TRUE, we get the probability we want, \(P[IQ < 115]\)
Top Tip
Sketch the distribution on paper and shade the area you want. This will help you to work out values to give the function.
Determine the probability of having an IQ of 115 OR MORE? Do a sketch first.
Determine the probability of having an IQ between 85 and 115? Do a sketch first.
Is this what you expect?
What is 1.96 \(\times\) the standard deviation?
What is the probability of having an IQ between -1.96 standard deviations and +1.96 standard deviations? Is this what you expect?
We can use qnorm() to find the IQ associated with a particular probability.
We will again use the values of mean = 100 and standard deviation = 15 in qnorm() to work out what I.Q. value 0.2 (20%) of people fall below. Make sure you relate the manual information to the command.
To find the I.Q. value that 20% people fall below:
qnorm(0.2, m, sd)## [1] 87.37568
20% people have an IQ less than 87.4
What I.Q. value are 0.025 (2.5%) of people below?
In what range do 99% of the population fall? Note that 99% means 1% (0.01) in both tails so 0.5% (0.005) in each tail. The figure may help you.
The only difference in using pnorm() and qnorm() for samples is in what we give as the sd argument. Since we are now thinking about the distribution of the sample means, we need to use the standard error.
We used mean = 100 and standard deviation = 15 in pnorm() to work out the probability of an individual having an I.Q. of 115 or less.
We can use a similar approach to find the probability of getting a sample of n = 5 having a mean I.Q. of 115 or less The only difference is that we use the standard error instead of the standard deviation.
First, calculate the standard error:
n <- 5
se <- 15 / sqrt(n) Now the probability of getting a sample mean of 115 or less from that distribution:
pnorm(115, m, se)## [1] 0.9873263
There’s a 0.9873 probability that a sample of 5 people will have a mean of 115 or less. Thus there is a probability of just 0.0127 that a sample of n = 5 will have a mean above 115. This is quite unlikely and we might suspect this group was not sampled from the general population.
What is the probability of sample of size 10 having a mean of 105 or more?
The data in beewing.txt are left wing widths of 100 honey bees (mm). The confidence interval for large samples is given by: \(\bar{x} \pm 1.96 \times s.e.\))
Where 1.96 is the quantile for 95% confidence.
You may need to refer to previous practicals to remind yourself how to carry out some of the following steps.
Save a copy of the file to your ‘data’ directory
Read in the data and check the structure of the resulting dataframe
Calculate and assign to variables: the mean, standard deviation and standard error
To calculate the 95% confidence interval we need to look up the quantile (multiplier) using
qnorm()
q <- qnorm(0.975) Now we can use it in our confidence interval calculation
lcl <- m - q * se
ucl <- m + q * se Print the values
lcl## [1] 4.473176
ucl## [1] 4.626824
Between what values would you be 99% confident of the population mean being?
The confidence interval for small samples is given by: \(\bar{x} \pm \sf t_{[d.f]} \times s.e.\)
The fatty acid Docosahexaenoic acid (DHA) is a major component of membrane phospholipids in nerve cells and deficiency leads to many behavioural and functional deficits. The cross sectional area of neurons in the CA 1 region of the hippocampus of normal rats is 155 \(\mu m^2\). A DHA deficient diet was fed to 8 animals and the cross sectional area (csa) of neurons is given in neuron.txt
Save a copy of the file. I saved mine to my ‘data’ directory
Read in the data and check the structure of the resulting dataframe
Assign the mean to
m
Calculate and assign the standard error to
se
To work out the confidence interval for our sample mean we need to use the t distribution because it is a small sample. This means we need to determine the degrees of freedom (the number in the sample minus one).
We can assign this to a variable,
df, using:
df <- length(neur$csa) - 1 The t value is found by:
t <- qt(0.975, df = df)Note that we are using qt() rather than qnorm() but that the probability used is the same. Finally, we need to put our mean, standard error and t value in the equation. \(\bar{x} \pm \sf t_{[d.f]} \times s.e.\).
The upper confidence limit is:
(m + t * se) %>% round(2)## [1] 151.95
The first part of the command, (m + t * se) calculates the upper limit. This is ‘piped’ in to the round() function to round the result to two decimal places.
Calculate the lower confidence limit:
Given the upper and lower confidence values for the estimate of the population mean, what do you think about the effect of the DHA deficient diet?